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Abstract 

In this paper, we develop an expUcit formula allowing to compute the first k moments 
of the random count of a pattern in a multi-states sequence generated by a Markov source. 
We derive efficient algorithms allowing to deal both with low or high complexity patterns 
and either homogeneous or heterogenous Markov models. We then apply these results to 
the distribution of DNA patterns in genomic sequences where we show that moment-based 
developments (namely: Edgeworth's expansion and Gram-Charlier type B series) allow 
to improve the reliability of common asymptotic approximations like Gaussian or Poisson 
approximations. 

1 Introduction 

The distribution of pattern counts in random sequence generated by Markov source have many 
applications in a wide range of fields including: reliability, insurance, communication systems, 
pattern matching, or bioinformatics. In this particular field, a common application is the sta- 
tistical detection of pattern of interest in biological sequences like DNA or proteins. Such ap- 
proaches have successfully led both to the confirmation of known biological signals (PROSITE 
signatures, CHI motifs , etc.) as well as the identification of new functional patterns (regulatory 
motifs in upstream regions, binding sites, etc.). Here follows a short selection of such work: 
[20, 37, 8, 13,3, 15, 19, 22]. 

From the statistical point of view, studying the distribution of the random count of a pattern 
(simple or complex) in a multi-states Markov chain is a difficult problem. A great deal of ef- 
forts have been spent on this problem in the last fifty years with many concurrent approaches 
and we give here only few references (see [32, 24, 28] for more comprehensive reviews). Exact 
methods are based on a wide range of techniques like Markov chain embedding, moment gen- 
erating functions, combinatorial methods, or exponential families [16, 35, 1, 9, 7, 27, 36, 6]. 
There is also a wide range of asymptotic approximations, the most popular among them being: 
Gaussian approximations [30, 10, 21, 31], Poisson approximations [18, 17, 33, 14] and Large 
deviations approximations [12, 26]. 

More recently, the connexion between this problem and the pattern matching theory have 
been pointed out by several authors [25, 1 1, 23, 29, 34]. Thanks to these approaches, it is now 
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possible to obtain an optimal Markov chain embedding of any pattern problem through minimal 
Deterministic Finite state Automata (DFA). In this paper, we want to apply this technique to 
the exact computation of the first k moments of a pattern count in a random sequence generated 
by a Markov source. Our aim is to provide efficient algorithms to perform these computations 
both for low and high complexity patterns and either considering homogeneous Markov model 
or heterogeneous ones. 

The paper is organized as follow. In a first part, we recall the principles of optimal Markov 
chain embedding through DFA. We then derive from the moment-generating function of the 
random pattern count a new expression for its first k moments, and introduce three different 
algorithms to compute it. The relative complexity of these algorithms in respect with previous 
approaches are then discussed. Finally, we apply Edgeworth's expansion and Gram-Charlier 
type B series techniques to obtain near Gaussian or near Poisson approximations and show 
how this allows to improve the reliability of classical asymptotic approximations with a modest 
additional cost. 

2 DFA and optimal Markov chain embedding 

2.1 Sequence model 

Let {Xi)i^i^i be a order d ^ Markov chain over the cardinal s ^ 2 alphabet A. For all 
1 ^ « ^ i ^ ^, we denote by Xf — Xi. . . Xj the subsequence between positions i and j. For 
all af ai .. .Ud e A'^,b e A, and! ^ i ^ e - d, let us denote by u (af) = P {Xf = af) the 
starting distribution and by 7ri+d(af , b) = F(Xi^d — b\Xl'^'^~^ — af) the transition probability 
towards Xi_^_d- 

2.2 Pattern count 

Let W be a finite set of words (for simplification purpose, we assume that W contains no 
word of length smaller or equal to d) over A. We consider the random number N of matching 
position of W in Xf defined by 



where S{Xl) is the set of all the suffixes of XI and where is the indicatrix function of event 
A. 

2.3 DFA 

As suggested in [25, 11, 23, 29], we perform a optimal Markov chain embedding of the prob- 
lem through a DFA. We use here the notations of [29]. Let {A, Q, cr, T , S) be a minimal DFA 
that recognize the language A*yV {A* denote the set of all - possibly empty - texts over A) of 
all texts over A ending with an occurrence of W. Q is a finite state space, cr e Q is the starting 
state, JF c Q is the subset of final states, and 5 : Q x ^ — > Q is the transition function. We re- 

def 

cursively extend the definition of b over Q x ^* thanks to the relation aw) = S{6{p, a),w) 
for all p G Q, a G A, w G A* . We additionally suppose that this automaton is non rf- ambiguous 
(a DFA having this property is also called a d-th order DFA in [23]) which means that for all 
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q E Q, 6^'^{p) = {af e 3p e Q, 5 (p, a^) = g} is either a singleton, or the empty set. 
When the notation is not ambiguous, 5~'^{p) may also denotes its unique element (singleton 
case). 

2.4 Markov chain embedding 

def — def 

Theorem 1. We consider the random sequence over Q defined by Xq = cr and Xi = 6 , Xi] 
Vi, 1 ^ i ^ Then {Xi)i^cL is a heterogeneous order 1 Markov chain over Q! = 6{s, A'' A*) 
such as, for all p,q E Q! and 1 ^ i ^ £ — c/ the starting distribution = P i^X^ = and 

the transition matrix Ti^^iP, q) = P (^^i+d = q\^i+d-i = are given by: 

9) = { ' ^ = (3) 

Proof. The result is immediate considering the properties of the DFA. See [23] or [29] for more 
details. □ 

2.5 Moment generating function 

Corollary 2. The moment generating function f{y) of N is given by: 

+00 n-d \ 

f{y) ^ P (TV = n) = /X, n ^P^+<^ + yQ^+<^) ^ 

n=0 \i=l / 

where 1 is a column vector of ones (in the same manner, we denote by is a column vector of 

def 

zeros) and where, for all 1 z £ - rf, Ti+d = Pi+d + Qi+d with Pi+d{p, q) = \iTTi+d{p, q) 

def 

and Qi+d{p, q) = lqej^Ti+d{p, q) for all p, g e Q'. 

Proof. Since Qj+d contains all counting transitions, we keep track of the number of occurrence 
by associating a dummy variable y to these transitions. We hence just have to compute the 
marginal distribution at the end of the sequence and sum up the contribution of each state. See 
[25, 1 1, 23, 29] for more details. □ 

Corollary 3. In the particular case where (Xi)i^i^^ is a homogeneous Markov chain we can 
drop the indices in Pj+d and Qi+d and Equation (4) simplifies into 

f{y) = i^d{P + yQf-''^. (5) 

Corollary 3 can be found explicitely in [23] or [34] but its (however straightforward) gen- 
eralization to heterogenous model (Corollary 2) appears to be a new result. 



3 



3 Main result 



Lemma 4. For all /c ^ we have 

= fc!/xJ A{n,.....}iym (6) 

\l^n<...<ifc^^-d / 

where for all I cN, Aij{y) = Pi+a + vQi+d ifi^I and Aij{y) = Qi+a ifiel. 

Proof. The lemma is obvious for A; = 0. We assume now that the lemma is true at fixed 

rank k. When derivating Equation (6), the key is then to see that for all / C N, A'^j{y) = 
Aijij{j}{y)- For each configuration / = {ii, .... ik+i}, it is hence obvious that Aij{y) 
appears in A', j^^^y for all j e /. This explains the A; + 1 factor which is combined to A;! to 
establish the lemma at rank k + 1. □ 

Theorem 5. For all A; ^ we have 

E ( (]^^^) = with g{y) = ^ (T,+, + yQ,+,) j 1 (7) 

and where [g{y)]yk denotes the coefficient of degree k in g{y). 

Proof. By derivating k times the moment generating function / we easily get E[A^! / ( A^— A;) !] = 
/('^^(l). Expanding the expression of g{y) at degree k then allows to identify the right term in 
Equation (6) for y = 1 thus proving the theorem. □ 

Corollary 6. In the particular case where is a homogeneous Markov Equation (7) 

simplifies into 

^ {-(]/^) = ^■l9iy)]y'' with g{y) = i^d{T + yQf-'l. (8) 



4 Three algorithms 
4.1 Full recursion 

For all 1 ^ i ^ ^ — d we consider column polynomial vector defined by 

def 

If we denote now by Ek{i) = [Ei{y)]yk its coefficient of degree k for all k ^ 0, then it is clear 
that we can rewrite the expression of g(y) in Equation (7) as [g(y)]yk = fidEk{l). 

Proposition 7. We have the following results for all 1 ^ i ^ £ — d: 

i) Eo{t) = 1; 

ii) Ei{£ - d) = Qel; 

iii) if A; ^ 1 and (£ - d - i + 1) < ^ then Ek(i) = 0; 

iv) if A; ^ 1 and i < ^ - d then Ek{i) = Ti+dEk{i + 1) + Qi+dEk-i{i + 1). 

Proof, i) It is clear that Eo{i) — (11^=1 ^i+<i)l which is equal to 1 since all Tj^d are stochastic 
matrices; ii) immediate; iii) the product must contains at least A; terms to have degree A; contribu- 
tion; iv) is easily proved by recurrence using the fact that Ei (y) — (Tj+d + yQi+d) Ei+i (y) . □ 
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Require: The starting distribution /x^, matrices Tj and Qi for aWl ^ i ^ i — d, and aO{kxL) 
workspace to keep the current values of Ej{i) for ^ j ^ k, where L denotes the cardinal 

of Q'. 

Initialization: 

Eo{i -d) = l, Ei{e -d) = Qel, and Ej{i - d) = for 2 ^ j ^ k. 

Recursion: 

fori = l)..ldo 

for j = /c. l do 

Ej{i) = Ti+dEj{i + 1) + Qi+dEj-i{i + 1) 

end for 
end for 

Output: for all ^ j ^ k, [g{y)]yj = l^dEjil) 

Algorithm 1: Compute the k first terms of g{y) in the most general case by performing a full 
recursion. The workspace complexity is 0{k x L) and since all matrix vector product exploit 

the sparse structure of the matrices, the time complexity is 0{£ x k x s x L) where s x L 
corresponds to the maximum number of non zero terms in Tj+d. 

4.2 Direct power computation 

From now on, we consider the particular case where the Markov model is homogeneous. Ac- 
cording to Equation (8) the expression of g{y) in such a case is then simplified into g{y) = 

fid{T + yQY^'^1. If we denote by Mi{y) =^ [(T + yQy]yQ..k our problem is then only to 
compute Me_d{y) since [g{y)]yj = [iJdMe-d{y)l]yj for all ^ j ^ k. 

Proposition 8. We have 

J 

M,_,(y) = J]M2.(y)"K=i} (10) 

j=0 

where l-d^ ao2° + ai2^ + ... + aj2^ with G {0, 1} for ^ j ^ J = [loggl^ - d)\ 
(Vx e R, \_x\ denotes the largest integer smaller than x). 

Proof. Immediate. □ 

Since we only need to compute the terms of degree smaller than k in Mg_j{y) to obtain the 
first k moments of A^, we can speed up the computation by ignoring terms of degree greater 
than k in Equation (10). We hence obtain Algorithm 2 where Tk\p{y)] denotes the truncated 
polynomial obtained from p{y) by dropping all terms of degree greater than k. 

4.3 Partial recursion 

In this particular section, we assume that T is an irreducible and aperiodic matrix and we denote 
by u the magnitude of its second eigenvalue when we order them by decreasing magnitude. 

def 

For alH ^ we consider the polynomial vector Fi{y) — [T + yQYl, and for all /c ^ we 

def 

denote by Fk{i) ~ [Fi{y)]yk the term of degree k in Fi{y). By convention, Fk{i) = if i < 0. 
It is then possible to rewrite the expression of g{y) in Equation (8) as [g{y)]yk — iJ.dFk{i — d). 

Additionnaly, let us finally define recursively the quantity Dj{i) for all k, i,j ^0 by Dl{i) = 
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Require: The starting distribution /x^, matrices T and Q, £, d, and 0{k x L'^ x J) for M2j (y) 
for ^ j ^ J and a polynomial matrix M(y). 
Preliminary computations: 

perform the binary decomposition £ — d — ao2° + . . . aj2'^ 
M2o{y) = {P + yQy 
for j = 1..J do 

M2j{y)^Tk[M2j-i{yy] 
end for 

Computing Me^diy)- 
M{y) = Mo{y) 
for j = 0..J do 

if aj = 1 then M(y) = [M(y) x M2. (z/)] 
end for 

Output: for all ^ j ^ k, [g{y)]yj = [jidMe-d{y)i]yj 

Algorithm 2: Compute the k first terms of g{y) in the particular case of a homoge- 
neous Markov model through a direct power computation. The workspace complexity is 
0{k X X log2^) and the time complexity is 0{k'^ x x log2^) (k'^ for the polynomial 
products and for the matrix products). 

Fk(i) and, if i ^ 1 and j ^ 1, Dlii) = Di-\i) - D{-\i - 1) so that 

Dm = i2^~lY(l)F,{t-S). (11) 

Lemma 9. We have the following initial conditions: 

i) Vz^O,L'0(z) = l 

ii) Vj ^ 1, L'^(i) = (-1)^(^-^)1 if ^ i ^ i - 1, and dUi) = if i ^ j 

iii) VA; ^ 1, ^0(0) = 0, and Dl(i) = TDl{i - 1) + QDl_S - 1) for i ^ 1. 
And for all /c, j, i ^ 1 we have the following recurrence relations: 

a) Di{i)^Di-'{i)-Di-\i-l) 

b) Di«=TDi(^-l) + gDt,(^-l) 

Proof, i) It is clear that -Do(«) = T'l = 1 since T is a stochastic matrix; ii) consequence 
of i) and Equation (11); iii) is proved by recurrence; a) is simply the definition of Dl(i); b) 
consequence of iii) and of the recursive definition of L>^(i). □ 

Lemma 9 provides an efficient way to compute all Dl(i) for ^ k,j ^ K and Q ^ i ^ a 
(see Algorithm 3). However, these computations suffer numerical instability in floating point 
algebra. This phenomenon is emprically studied in section 5.3. 

Lemma 10. For all A; ^ 1 we have: 

i) D'ki^ = Vj=kT'-'QDti{j - k) fori ^ k; 

ii) 3Cfc e such as (i) = + 0{kv'l^) and Dl+^{i) = + 0{kv'/^) for all i ^ 2k. 
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Proof, i) is a direct application of Lemma lb). For A; = 1, i) simply gives D\{i) = ^Ql 
which proves ii) for A; = 1. We assume that ii) is true for some fixed rank k and then decompose 

DlX\{^ = ( T--^QDl-^\3 - - 1) j + J] T'-^QDl-^\j -k-l) (12) 

\j=A;+l / j=a+l 

^ V ' " V ' 

A B 

for some a ^ 2k. Thanks to the stochasticity of T, 3C^+i e MP such as A = C^+i + 
and since ii) is true at rank k, B — Yl]=a^{kv^/^). Elementary analysis proves 

that min^ + X]j=a ^^*''^'^| —0{{k-\- the minimum being obtained for a — 

i{k — l)/k. ii) it then proved at rank A; + 1 with C^+i = C^+i that particular a. □ 



Proposition 11. For all /c ^ 1 and ^ j ^ k and for any i ^ 2k 

k-j ^ 



j'=0 

and in the particular case where j = we get 

k 



F,{^)^F,{a) + J2^^ ^, («) + O r)""^') " ^^^^ 

Proof. A simple application of Lemma lOii) proves that D^{i) = D^{a) + 0{v'^/^) which is 
exactly Equation (13) for j = k. We then obtain the result for j < A: by recurrence and the fact 
that Di{{) = Diia) + Yj,=^^, D^^\^') and that Ej^^.+i = (j^l)- 

□ 



Require: The matrices T and Q, a value a ^ K, and a 0{K'^ x L) workspace to keep the 
current value of -D^(i) and D^i — 1) for all ^ k,j ^ K 
for i — 0..a do 
Initialization: 

D'oi^ = 1 

for j = 1. .K do D^Q{i) = (-1)^(^7^)1 if ^i^j- 1, andD^(i) = Oifi > j endfor 
for k = 1..K do Dl{i) = if i = 0, and Dl{i) = TDl{i - 1) + Q£>^_i(i - 1) if i ^ 1 
endfor 

end for 

Recursion: 

for k — 1..K and J = 1..K do 

update Di{i) either with £>r'(0 - Di~\i - 1) or rL>^(i - 1) + QDi_,{i - 1) 
end for 

Algorithm 3: Compute Dl.{a) for all ^ k,j ^ K. The workspace complexity is 0{K'^ x 
L) and since all matrix vector product exploit the sparse structure of the matrices, the time 
complexity is 0(q; x i^T^ x s x L). 
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4.4 Comparison with known methods 



Up to our knowledge, there is no record of method allowing to compute order k moments of 
pattern count in heterogeneous Markov sequences. This work was in fact initially motivated 
by this observation. In the homogeneous case however, many interesting approaches can be 
found in the literature. In most case, these methods are limited to the computation of the first 
two moments, but several of them can be also used to get arbitrary order moments like with our 
method. 

One of these approaches consist to consider the bivariate moment generating function 



where Ni is the random number of pattern occurrences in a sequence of length £. Thanks to 
Equation (5) it is easy to show that 



where / denotes the identity matrix. It is then possible to get order k moments of Ne using the 
relation: 



Such interesting approach have been developed by several authors including [25] and [23]. In 
order to apply this method, one should first use a Computer Algebra System (CAS) to perform 
the bivariate polynomial inversion of matrix I — z{P + yQ) to get f{y, z) thus resulting in a 
complexity 0{L^) where L is the number of states in the embedding Markov chain. One hence 
needs to compute the order k partial derivative in y of f{y, z) prior to to perform (fast) Taylor 
expansion of the result up to z^. The resulting complexity is ©(logjf x D^) where D is the 
degree of the denominator in d''f/dy''{l, z). Like in Algorithm 2 we get a cubic complexity 
with for linear algebra computations, and a logarithmic complexity with i thanks to the 
binary decomposition. However, this method is much more sophisticated to implement (CAS 
against simple manipulation of polynomial matrices) and the D"^ term that appears in the Taylor 
expansion complexity hide in fact at least a cubic complexity in k which is not easy to handle. 
Let us note that [25] also suggests to obtain asymptotic development of moments by computing 
only the local behaviour of the generating function f{y, z) which allows computation to be 
performed in faster floating point arithmetic. However, this approach can not gives the exact 
moments but only approximations, and one still require to perform the formal inversion of an 
order L bivariate polynomial matrix which is an expensive step. 

More recently, [34] suggested to compute full bulk of the exact distribution of Nn through 
Equation (5) using a power method like in Section 4.2 with the difference that all polynomial 
products are performed using Fast Fourier Transform (EFT). The drawback EFT polynomial 
products is that the resulting coefficient are known with an absolute precision equal to the 
largest one times the relative precision of floating point. As a consequence, the distribution is 
well computed only in its center part. Fortunately, this is precisely the part of the distribution 
that matters for moment computations. Using this approach, and a very careful implementation, 
one can then compute the full distribution with a complexity 0(L^ x log2 (. x rimax log2 rimax) 
where is the maximum number of pattern occurrences in the sequence. Once again, the 
resulting complexity is likely to be much higher that the one of Algorithm 2 since k'^ is usually 
far smaller than nmax log2 ^max- Moreover, Algorithm 2 is again much easier to implement than 
this sophisticated EFT approach. 




(15) 



f{y,z)^z''xyia{I-z{P + yQ))-^l 



(16) 




(17) 



8 



Finally, one should note that both these two known approaches involve a complexity O (L^) 
in time (and at least 0{L'^) in memory) which makes difficult or even impossible to use them 
for moderate or high complexity patterns (ex: L = 100 or L = 1000). For such patterns, 
Algorithm 1 appears to be a safe but slow alternative (linear complexity with sequence length £) 
and Algorithm 3 seems to be a very promising approach since it allows to handle such complex 
patterns while retaining a logarithmic complexity with i like in Algorithm 2. Unfortunately, 
the numerical instabilities observed in practice with Algorithm 3 need to be investigated further 
before to trust this approach. 

5 Application to DNA patterns in genomics 

5.1 Dataset 

We consider the a order d — 1 homogeneous Markov model over A — {A, C, G, T} which 
transition matrix estimated over the complete genome of the bacteria Escherichia, coli is given 
by: 

/ 0.30 0.21 0.22 0.27 \ 
0.23 0.23 0.33 0.22 
^~ 0.28 0.29 0.23 0.20 
\ 0.19 0.28 0.23 0.30 / 

We consider a sequence X — Xi . . . of length i — 400 000 and starting with Xi — A. 

5.2 Some moments 

In this section, we compute the first k — A moments of several DNA patterns. We then use 
these moments to compute: 

expectation m = mi, standard deviation a = \fm^ 

skewness 71 = mz/rrd/'^ , and excess kurtosis 72 — 1114/ ml — 3 

where rrii = K[{N — mi)*] is the centered moment of order i. A negative (resp. positive) 
skewness indicates that the mass of the distribution is concentrated on the right (resp. left) side 
of the expectation. A skewness of zero indicates a balanced distribution. A negative (resp. 
positive) excess kurtosis indicates that the distribution is more flat (resp. more peaked) than the 
Gaussian distribution. A Gaussian distribution has a excess kurtosis of zero. 

On Table 1 we can see the value of these quantities for several DNA patterns. For the first 
three simple patterns, we can see how the additional information off skewness and excess kur- 
tosis gives us a better description of their distribution. For example, we know from theory that 
highly overlapping patterns are distributed according to compound Poisson approximations. 
This is exactly why we observe an increasement of skewness and kurtosis from Pattern GCTGGT 
(non-overlapping) to Pattern GGGGGG (highly self-overlapping). 

If we consider now the more complex patterns of the second part of Table 1 we can observe 
how the running time of Algorithm 2 quickly increases with L. This is obviously not a surprise 
since we expect a cubic complexity in this parameter with this approach. One should however 
note that it is nevertheless possible to deal with moderately complex patterns like GNNGNNGG 
which contains in fact a total of 4"^ = 256 simple patterns. Another interesting observation is 
that both skewness and kurtosis get closer to zero when we add more symbol N into the pattern. 
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Table 1 : First four moments of several DNA patterns computed through the power algorithm 
(running time indicated in seconds). The background model is the order d = 1 homogeneous 
Markov mo del defined in section 5.1 and the sequence length isi — 400, OOP. 
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8.364 
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0.01413 


0.09 


AGAGAG 


9 


84.89 


9.791 


0.12780 


0.01903 


0.09 


GGGGGG 


9 


65.91 


10.260 


0.20290 


0.05363 


0.09 


GCTGGTGG 


11 


3.782 


1.945 


0.51420 


0.26430 


0.11 


GCTGGNGG 


14 


20.79 


4.559 


0.21920 


0.04801 


0.11 


GNTGGNGG 


21 


79.55 


9.014 


0.11570 


0.01390 


0.49 


GNTGNNGG 


28 


340.1 


18.680 


0.05628 


0.00331 


1.10 


GNNGNNGG 


63 


1508.0 


42.290 


0.03283 


0.00136 


15.80 



This is due to the fact that adding more N makes the pattern more frequent (this can be seen 
with the geometrically increasing expectation) and that Gaussian approximations for pattern 
problem are well known to work better for frequent patterns. 

5.3 Numerical stability of the partial recursion 

On Figure 1 we study empirically the convergence of towards by computing | | 

for several k through Algorithm 3. We consider here three way of updating Dl (i) : by using only 
through Dl~'^{i) - D{7^{i - 1) (Red curve); by using only through TD{{i - 1) + QD{_-^{;l - 1) 
(Blue curve); or by taking the update which displays the smallest norm (Black curve). If these 
three alternative approaches give similar results when | |i5^+^(i) 1 1^ ^ 10~^^ differences start 
to appear for smaller values. The differential recurrence relation (Red curve) quickly start to 
accumulate machine precision residuals and results in noisy curves with a slow increasement. 
When using the matrix recurrence relation (Blue curve) a similar problem arise, however ap- 
pearing slightly later and with far less noise. Surprisingly, the last approach which combine the 
two updating methods at each step benefits from a synergistic effect and displays a far better 
stability. A similar behaviour have been observed for a wide range of tested patterns (data not 
shown). 

5.4 Near Gaussian approximations 

Gaussian approximations for random pattern counts are widely used in the literature. We want 
here to push forward this idea by taking advantage of higher order moments to get near Gaus- 
sian approximations. This well known technique is described in details in Appendix B. 

We can see on Figure 2 the relative error (in log-scale) of several Edgeworth's approxi- 
mations for the distribution of pattern GCTGGT. The solid line shows the reliability of plain 
Gaussian approximation (which correspond to an order s = Edgeworth's expansion). Un- 
surprisingly, this approximation works better around the expectation (E[A^] = 70.09 according 
to Table 1) providing two exact digits on the range [54; 85], and one exact digit on the range 
[50; 92]. Beyond these limit, we get too far in the tail distribution to get reliable results. This 
behaviour is exactly what we expect from the central limit theory. 
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Figure 1 : Plot of log^o 1 1 D^-^^ (i) 1 1 ^ (|/-axis) for 1 ^ A; ^ 9 (from left to right), and 1 ^ i ^ 100 
(s-axis) for the pattern W = GNTGNNGG over the DNA alphabet A = {A, C, G, T} (N symbol 
meaning "any letter") using a order d = 1 Markov model. The curves are obtained through 
Algorithm 3 using recurrence relation Lemma : a) only (Red curve); b) only (Blue curve); 
a) and b) keeping the Dl{i) displaying the smallest norm (Black curve). All missing values 
correspond to | |D^+^(i) 1 1 =0. 
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Edgeworth's Expansion 




n 



Figure 2: Relative error in decimal log scale of Edgeworth's expansion of order s = (Red- 
solid), order s = 3 (Blue-dotdashed), and order s = 5 (Black-dashed) for Pattern GCTGGT on a 
order 1 homogeneous Markov model (parameter estimated on the complete genome of E. coli) 
of length £ = 400 000. 
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If we consider now order s = 3 Edgeworth's expansion (that uses moments up to order 
A; = 5) depicted with a dotdashed line on Figure 2, we see a dramatic improvement both on the 
accuracy of the approximation (up to 6 exact digits) and on the range of reliability (at least one 
exact digit on [28; 118]). We can even get a further improvement by considering order s = 5 
expansion (dashed line) which uses moments up to order k = 7. In both case however, the 
reliability of these approximations decreases dramatically when we get far enough in the tail 
distributions. 

We observe a very similar behaviour for Pattern AGAGAG and Pattern GGGGGG and the corre- 
sponding figures are hence not shown to save space. 

Thanks to this work we see that for a modest additional cost (computing moments up to 
order k — 5 or k — 7 instead of simple first and second moments), one can dramatically 
improve the reliability of Gaussian approximations for pattern problems. 

5.5 Near Poisson approximations 

A very common alternative to Gaussian approximations for random pattern counts is to turn 
towards Poisson approximations. These approximations are known to be quite accurate for non- 
overlapping patterns, but also to fail for highly self overlapping patterns for which compound 
Poisson approximations are known to perform better. We want here to evaluation the interest 
of near Poisson approximations provided by the Gram-Charlier Type B series described in 
Appendix C. 

For the non-overlapping pattern GCTGGT, we can see on Figure 3 that the plain Poisson 
approximation (order s = Gram-Charlier Type B series) gives already very good results 
with at least one exact digit on all the distribution, and up to 4 or 5 of them in the region 
close to the expectation. This interesting result is dramatically improved by the order s = 4 
approximations which gives at least 4 exact digits on all the considered range and more that 8 
exact digits around the expectation. Surprisingly, the order s = 8 approximation is less reliable 
than the previous one, and gives even worse results that the plain Poisson approximation in the 
tail distributions. This is due to the fact that the coefficients computed according to Equation 
(27) accumulate large terms that compensate each other. This is a typical scenario for large 
relative errors in floating point arithmetic. One can solve this problem either by performing 
computations with an arbitrary number of digits (usually slow=), or one can explicitly compute 
the expected relative error with the current machine-precision and renounce to use unreliable 
coefficients. 

If we consider now the self-overlapping pattern AGAGAG, we know from theory that Poisson 
approximations are not supposed to perform well. This is the reason why we observe on Figure 
4 that the plain Poisson approximations only works on a very limited range the distribution 
(roughly on [69; 103]). Once again however, order s = 4 or s = 8 Gram-Charlier expansion 
dramatically improve the reliability of the approximations getting up to 6 exact digits close 
to the expectation and at least one exact digits on a much wider range (up to [24; 150] for 
order s = 8). One should note that in this case, the numerical issue observed for high order 
approximations for the previous pattern does not occur. We get a very similar result for the 
even more self-overlapping pattern GGGGGG and the corresponding figure is then omitted to 
save space. 

Like with near Gaussian approximations, we see that near Poisson approximations can 
dramatically improve the reliability of Poisson approximations for a very modest cost (ex: 
computing moments up to order k — A or k — S). 
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Gram-Charlier Type B Series 




50 100 150 



Figure 3: Relative error in decimal log scale of Gram-Charlier type B approximation of order 
s = (Red- solid) to order s = 4 (Blue-dotdashed) to order s = 8 (Black-dashed) for Pat- 
tern GCTGGT on a order 1 homogeneous Markov model (parameter estimated on the complete 
genome of E. coli) of length ^ = 400 000. 
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Gram-Charlier Type B Series 




n 



Figure 4: Relative error in decimal log scale of Gram-Charlier type B approximation of order 
s = (Red- solid) to order s = 4 (Blue-dotdashed) to order s = 8 (Black-dashed) for Pat- 
tern AGAGAG on a order 1 homogeneous Markov model (parameter estimated on the complete 
genome of E. coli) of length ^ = 400 000. 
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6 Conclusion 

In this paper, we have derived from the explicit expression of the mgf of a pattern random 
count N, a new formula allowing to compute a arbitrary number k of moments of N. We also 
have introduced three efficient algorithms to perform this computation. The first one allow 
the computation of pattern count moments of arbitrary order in the framework heterogeneous 
Markov model which is a completely new result (up to our knowledge). The second algo- 
rithm, suitable for homogeneous models and low complexity patterns, appear to have a better 
or similar complexity to state-of-the art known algorithms but with a far much simpler imple- 
mentation. Finally, the third algorithms uses partial recursions exploiting the sparse structure 
of the transition matrix to provide a logarithmic complexity with the sequence length even 
for high complexity patterns. This very promising approach however suffers from numerical 
instabilities in floating point arithmetic that need to be further investigated. 

One should note that our main result can be easily extended to mixed moments of several 
pattern counts. In order to save space, we give here such as result only for the particular case of 
two patterns Wi and W2 in a homogeneous model. We assume that the final states of or DFA 
could be partitioned into T = T\\J T2 such as T\ (resp. T-}) count the number N\ (resp. iV2) 
of occurrences of Wi (resp. W2). This is always possible by duplicating states. We consider 



and we then have /(yi, 7/2) = H-dkP + y\Q\ + ViQ-i)^ ''l- By introducing now g{yx, ^2) = 
y^d {T + yiQi + y2Q2Y~'^ 1 we get for anyki,k2^0 that: 



As an application, we have considered the distribution of DNA patterns in genomic se- 
quences. In this particular framework, we have shown how order k = 3 and k — A moments 
allow to get a better description of the distribution (with quantities like skewness and excess 
kurtosis). We have also considered moment-based approximations namely Edgeworth's expan- 
sion (near Gaussian approximations) and Gram-Charlier Type B series (near Poisson approxi- 
mations). For both approximations, we have seen how the additional information provided by 
a couple of higher order moments can dramatically improve the reliability of these common 
approximations. As a perspective, it seems to be very promising to develop near geometric or 
compound Poisson distribution with Gram-Charlier Type B series. 



A Moments and cumulants 

def 

For any random variable X and for any A; ^ we define the following quantities: Qk ~ 
1/A;!E [X\/{X — fc)!] the coefficient of degree k in the polynomial g{y) defined in Section 3; 

m'j. ^ 'E[X^) the moment of order k; rrik = E[{N — m'^)*^] the centered moment of order k; 

and Kk the cumulant of order k defined by h{t) = logE(e*'^ ) = J2k>i '^kit^ /k\). Cumulants 
and moments are connected through the following formula: 




(18) 




(19) 



APPENDIX 




(20) 
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Using this formula we get: ni — E(X) and k,2 — m2 — ^3 = ms, and k,^ — 1714 — 3m|. 

The skewness 71 and excess kurtosis can be expressed from cumulants: 71 = K3/k2^^ and 
72 = ka/kI. 



B Edgeworth's expansion 

This is directly taken from [5] except the explicit order 5 expansion given in Equation (24) 
which is a new contribution (only order 3 explicit expansions seems to be available in the 
literature). 

Let X be a centered random variable (E[X] = 0) that admit finite moments of all orders (we 

denote by cr^ the variance of X), let $ defined by $(i) = E[e*^] (where i denote the imagmary 
complex number) be its caracteristic function. Let ip be the caracteristic function of X/a, we 
have ip{t) = ^{t/a). The definition of cumulants (see Appendix A) then allows to write the 
expansion: 

00 

log0(t) = log^t/a) ~ Yl -^^^f (21) 

k=2 




then by denoting Sk 

fr + 2)!^" 

r=l 

The Fourier transform of expansion (22) then gives: 



(22) 



def 

where q{x) — ap{ax) is the probability distribution function (pdf) of X/a (j){x) being the 
pdf of X), where Z{x) = exp(— a;^/2) / v^27r is the pdf of a standard Gaussian variable, where 
{km}s is the set of all non-negative integer solution of the Diophantine equation ki + 2k2 + 
. . . + skg = s, r = ki + k2 + ■ ■ ■ + kg, and where Hk{x) are the Hermite polynomials defined 
recursively by Hq{x) = 1 and Hk{x) = xHk_i{x) — H'f^_^{x) for all k ^ I. 

Here are the sets of {km}s for 1 ^ s ^ 5: {km}i = {1}, {km}2 = {20,01}, {/c^js = 
{300, 110, 001}, {km}4 = {4000, 2100, 0200, 1010, 0001}, and {km}5 = {50000, 31000, 12000, 
20100, 01100, 10010, 00001}, and here is the explicit expression of (23) up to order s = 5 (such 
an explicit expression can be found up to s = 3 in [4]): 

~ 1 + cr<^ ^3(2;)^ 



Z{x) ~ {"''31 

+ (h,{x)^ + H,ix)S] + l^5(x) t + H,ix)^ + H^ixS 



4! '2!3!2 J [ ' 5! " ' 3!4! ' 

+ <J < H,{x)— + H^{x) — — + — — + H^r{x) —--J + 



7! ' V 4!5! 3!6! / ' \2\3\^5\ 2\3W.^ 
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C Gram-Charlier type B serie for near Poisson distribution 

This is initially taken from [2] but we derive new recurrence relation that are more adapted to a 
modem computational framework than the explicit (and sometimes erroneous) formulas given 
in the original article. 

Let z/'(i) = e~^y/i\ be the pdf of a Poisson distribution of parameter A, and let A be the 

def 

differential operator defined by Aip{i) = ip(i) — ipii — 1). Our objective is to approximate the 
pdf F of a discrete non-negative random variable X with 

F(i)~^c,-A^VW (25) 

3=0 

In order to do so we use a moment method and find a solution (cq, ci , . . . , Cg) of Yl]=o ^j^l (^) ~ 

¥.[X^] for all ^ A; ^ s with P^(A) = E^o i^^^i^i^) for all j, A; ^ 0. 

It is clear that we have Pq{X) — 1, and we have the following recurrence relation for all 



Pf(A) + f (A) 



and Pl^\\)^-^{X). (26) 



We hence get that cq = 1 and we derive the following recurrent relation for A; ^ 1: 

k-l 



Please note that P^{X) is always a scalar. If we now denote by gk = 1/A;!E [X\/{X - A;)!] the 
we can show by recurrence for all A; ^ 1 that we finally have: 



Ck 



A;! - ^ {k-j)\ 



Here are the explicit first 5 terms of this formula: 

al , gf ,9lg2 gf 

C2 = 5/2 - Y C3 = -gs + gig2 -— C4 = 5/4 - gm + ^ 

9I93 , 9^92 g\ . 9I94 gfgs . 9^92 gf 
C5 = -^5 + ^1^4- ^ + 3^ ce-ge-g,g, + -^ 6" + ^"l44- 
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